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Abstract. We provide a remedy for two concerns that have dogged 
the use of principal components in regression: (i) principal components 
are computed from the predictors alone and do not make apparent 
use of the response, and (ii) principal components are not invariant 
or equivariant under full rank linear transformation of the predictors. 
The development begins with principal fitted components [Cook, R. 
D. (2007). Fisher lecture: Dimension reduction in regression (with dis- 
cussion). Statist. Sci. 22 1-26] and uses normal models for the inverse 
regression of the predictors on the response to gain reductive infor- 
mation for the forward regression of interest. This approach includes 
methodology for testing hypotheses about the number of components 
and about conditional independencies among the predictors. 

Key words and phrases: Central subspace, dimension reduction, in- 
verse regression, principal components. 



1. INTRODUCTION 

Principal components have a long history of use 
as a dimension reduction method in regression, and 
today are widely represented in the applied sciences. 
The basic idea is to replace the predictor vector 
X G ]RP with a few of the principal components vJX, 
j = 1,. . . ,p, prior to regression with response Y G 
M}, where Vj is the eigenvector of the sample co- 
variance matrix 5] of X corresponding to its jth. 
largest eigenvalue. The leading components, those 
corresponding to the larger eigenvalues, are typically 
chosen. Collinearity is the main and often the only 
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motivation for the use of principal components in 
regression, but our results show no necessary link 
between the presence of collinearity and the effec- 
tiveness of principal component reduction. 

Although collinearity is perhaps the primary his- 
torical reason for interest in principal components, 
they have been widely used in recent years for di- 
mension reduction in regressions where the sample 
size n is less than p. This motivation is prevalent 
in the genomics literature (see, e.g., Bura and Pfeif- 
fer, 2003, and Li and Li, 2004) and is an impor- 
tant ingredient in the method of supervised prin- 
cipal components by Bair et al. (2006). Principal 
components can also be useful regardless of the pres- 
ence of collinearity or the relationship between n 
and p, depending on the goals of the analysis. For 
instance, it is often desirable to have an informative 
low-dimensional graphical representation of the data 
to facilitate model building and aid understanding 
(Cook, 1998). If p < 2 we can use computer graphics 
to view the data in full. If p = 3 and the response 
is categorical we can use a three-dimensional plot of 
the predictors with points marked according to the 
value of the response to view the full data. However, 
if p > 3 or p = 3 and the response is continuous we 
cannot view the data in full and dimension reduction 
may be useful. 
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Two general concerns have dogged the use of prin- 
cipal components. The first is that principal compo- 
nents are computed from the marginal distribution 
of X and there is no reason in principle why the 
leading principal components should carry the es- 
sential information about Y (Cox, 1968). The sec- 
ond is that principal components are not invariant 
or equivariant under full rank linear transformations 
of X, leading to problems in practice when the pre- 
dictors are in different scales or have different vari- 
ances. Some authors standardize the predictors so 
S is a correlation matrix prior to computing the 
components. 

In this article we propose a model-based approach 
to principal component reduction that can be adapt- 
ed to a specific response Y and that is equivariant 
under full rank linear transformations of X. Our re- 
sults build on Cook's (2007) formulation, which uses 
model-based inverse regression of X on y to gain re- 
ductive information for the forward regression of Y 
on X. In Section 2 we introduce the models: Cook's 
models are reviewed in Section 2.1 and our extension 
is described in Section 2.2. We address estimation in 
Section 3 and relationships with other methods in 
Section 4. Inference is considered in Sections 5 and 
6, and Section 7 contains illustrations of the pro- 
posed methodology. We regard the developments to 
this point as perhaps the most useful in practice. 
Nevertheless, in Section 8 we discuss model varia- 
tions that may also be useful. Proofs can be found 
in the Appendices. 

2. MODELS 

We assume that the data consist of n independent 
observations on (X, y). Let X^ denote a random 
vector distributed as X|(y = y), and assume that 
Xy is normally distributed with mean fXy and con- 
stant variance A > 0. Let p, = E(X) and let Sr = 
spa,n{fXy — fi\y £ Sy}, where Sy denotes the sample 
space of Y and F € M^^'^ denotes a semi-orthogonal 
matrix whose columns form a basis for the 
d-dimensional subspace Sr- Then we can write 

(1) X,y = /i + Ti/y + e, 

where e is independent of Y and normally distributed 
with mean and covariance matrix A, and fy = 
V'^{fiy — fi) G M'^; we assume that var(i/y) > 0. This 
model represents the fact that the translated condi- 
tional means Hy — p, fall in the (i-dimensional sub- 
space Sr . In full generality, a reduction T -.W ^ 



IK'^, Q ^P, is sufficient if y|X ~ y|T(X), or equiv- 
alently if y _IL X|T(X), since X can then be re- 
placed by T(X) without loss of information on the 
regression. Under model (1) the specific reduction 
i?(X) = r^A-^X is sufficient (Cook, 2007) and the 
goal is to estimate the dimension reduction subspace 
A-^Sr = {A-^z:z G St}, since T is not generally 
identified. Then i?(X) = ry-^X is a sufficient reduc- 
tion for any matrix r] ^W^'^ whose columns form a 
basis for A.~^Sy- The parameter space for A~^5r 
and Sr is the d-dimensional Grassmann manifold 
G(d,p) ill The manifold G(d,p) has analytic dimen- 
sion d{p — d) (Chikuse, 2003, page 9), which is the 
number of reals needed to specify uniquely a sin- 
gle subspace in G{d,p)- This count will be used later 
when determining degrees of freedom. 

Let 5f;(A,B) denote the span of A~-^/^ times the 
first d eigenvectors of A~^/^BA~^/^, where A and 
B are symmetric matrices and, as used in this arti- 
cle, A will always be a nonsingular covariance ma- 
trix. Beginning with B we apply the transforma- 
tion A~^/^ before computing the first d eigenvectors. 
Multiplying these eigenvectors by A^^^"^ then con- 
verts them to vectors that span the desired subspace 
in the original scale. The subspace 5d(A, B) can also 
be described as the span of the first d eigenvectors 
of B relative to A. This notation is intended as a 
convenient way of describing estimators of A~^5r 
under various conditions. 

2.1 PC and Isotonic PFC Models 

Cook (2007) developed estimation methods for two 
special cases of model (1). In the first, Uy is unknown 
for all y £ Sy but the errors are isotonic; that is, 
A = cr^Ip. This is called the PC model since the 
maximum likelihood estimator (MLE) of A"^5r = 
Sr is 5rf(Ip, S), the span of the first d eigenvectors 
of S. Thus -R(X) is estimated by the first d principal 
components. This relatively simple result arises be- 
cause of the nature A. Since the errors are isotonic, 
the contours of A are circular. When the signal Tvy 
is added the contours of I] = T var (i/y )r"^ -|- a'^Ip be- 
come ^-dimensional ellipses with their longest d axes 
spanning Sr ■ 

In the second version of model (1), the coordi- 
nate vectors are modeled as Uy = (3{{y — E(fy)}, 
where fy £ MJ" is a known vector- valued function 
of y with linearly independent elements and /3 G 
M'^^'^', d < min(r,p), is an unrestricted rank d ma- 
trix. Under this model for i/„ each coordinate Xyi, 
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j = 1, . . . ,p, of Xy follows a linear model with pre- 
dictor vector iy. Consequently, we are able to use 
inverse response plots (Cook, 1998, Chapter 10) of 
Xyj versus y, j = 1, . . . ,p, to gain information about 
suitable choices for f^, which is an ability that is not 
generally available in the forward regression of Y 
on X. For example, Bura and Cook (2001), Figure 
lb, present a scatterplot matrix of the five variables 
in a regression with four predictors. The inverse re- 
sponse plots can all be fitted reasonably with log(y), 
indicating that in their example iy = log{y) may be 
adequate. In some regressions there may be a nat- 
ural choice for fy. Suppose for instance that Y is 
categorical, taking values in one of h categories Ck, 
k = 1, . . . ,h. We can then set r = h — 1 and spec- 
ify the kth element of iy to be J{y £ Ck), where J 
is the indicator function. When Y is continuous we 
can consider f^'s that contain a reasonably flexible 
set of basis functions, like polynomial terms in Y , 
which may be useful when it is impractical to apply 
graphical methods to all of the predictors. Another 
option consists of "slicing" the observed values of Y 
into h bins (categories) C^, k = 1, . . . ,h, and then 
specifying the kth coordinate of fy as for the case of 
a categorical Y. This has the effect of approximat- 
ing each conditional mean E{Xyj) as a step function 
of y with h steps, 

h~l 

E{Xyj) -fij + Y. 7jbfc{ J(y G Ck) - Pr(F G C,,)}, 
k=l 

where "fj is the jth row of T and is the kth 
column of (3. Piecewise polynomials could also be 
used. 

Models with Uy = fi{fy — E(fy)} are called princi- 
pal fitted component (PFC) models. When the errors 
are isotonic the MLE of 5r is 5d(Ip, Sfit), where £fit 
is the sample covariance matrix of the fitted vectors 
from the multivariate linear regression of Xy on f^, 
including an intercept. In more detail, let X denote 
the nxp matrix with rows (X^ — X)"^ and let F de- 
note the n X r matrix with rows (fy-f)^. Then the 
nx p matrix of fitted vectors from the regression of 
Xy on fy is X = PfX and Sgt = X^PfX/?i, where 
Pf denotes the linear operator that projects onto 
the subspace spanned by the columns of F. Under 
this isotonic PFC model the MLE of i?(X) consists 
of the first d principal fitted components, with eigen- 
vectors computed from instead of S = X-'^X/n. 
The covariance matrix of the residual vectors from 
the fit of Xy on iy can be represented as Sres = 



^ — ^fit = X-'" QfX / n , where Qf = In — Pf • This ma- 
trix plays no direct role in the isotonic PFC model, 
since we have specified A = u^Ip. However, Sires will 
play a role in the extensions that follow. 

2.2 The PFC Model 

Principal fitted components are an adaptation of 
principal components to a particular response Y. 
However, the isotonic error structure A = a'^lp is 
restrictive and does not address the invariance issue. 
In this article we extend principal fitted components 
to allow for a general error structure. Our specific 
goal is to develop maximum likelihood estimation 
of A~^5r and related inference methods under the 
following PFC model, 

(2) Xy = fi + T(3{iy-E{{Y)} + e = fi + T(3iy + e, 

where /x = /i - r/3E(fy) and var(5:) = A > 0. This 
approach will then provide a solution to the two 
long-standing issues that have plagued the applica- 
tion of principal components. Assuming that Sires > 
0, we will show that the MLE of the sufficient re- 
duction i?(X) can be computed straightforwardly as 
the first d principal components based on the stan- 
dardized predictor vector 5]res~^^'^X. We found this 
result to be surprising since it does not depend ex- 
plicitly on the MLE of A, which is a necessary in- 
gredient for the MLE of A~^5r. 

In Section 3 we give the MLE of A and five equiv- 
alent forms for the MLE of A~^5r under model 
(2). Relationships with some other methods are dis- 
cussed in Section 4. We turn to inference in Sec- 
tions 5 and 6, presenting ways of inferring about d 
and about the active predictors. In Section 8 we dis- 
cuss versions of model (2) in which A is structured, 
providing modeling possibilities between the PFC 
models with A = cr^Ip and A > 0. 

2.3 Identifying A~^5r as the Central Subspace 

In this section we provide a connection between 
the model-based dimension reduction considered in 
this article and the theory of model-free sufficient 
dimension reduction. 

In the model-based approach, sufficient reductions 
i?(X) can in principle be determined from the model 
itself. For example, in the case of model (1) we saw 
previously that R{X) = F^A-^X is a sufficient re- 
duction. In model- free dimension reduction there is 
no specific law to guide the choice of a reduction. 
However, progress is still possible by restricting at- 
tention to the class of linear reductions. A linear 
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reduction always exists since -R(X) = X is trivially 
sufficient. If -R(X) = rj'^'K is a /c-dimensional suffi- 
cient reduction, then so is i?(X) = {r/A)^^ for any 
k X k full rank matrix A. Consequently, only the 
subspace span(77) spanned by the columns of rj can 
be identified — span(r/) is called a dimension reduc- 
tion subspace. 

If span(77) is a dimension reduction subspace then 
so is span(?7, 77;^) for any matrix p x ki matrix rji. 
As a result there may be many linear sufficient re- 
ductions in a particular regression and we seek the 
one with the smallest dimension. If span(r/]^) and 
span(772) are both dimension reduction subspaces, 
then under mild conditions (Cook, 1998) so is 
span(r7]^) n span(?72). Consequently, the inferential 
target in model-free sufficient dimension reduction 
is often taken to be the central subspace 5y|x, de- 
fined as the intersection of all dimension reduction 
subspaces (Cook, 1994, 1998). 

The following theorem enables us to conclude that 
under model (2) A~^5r = '5y|x; that is, the infer- 
ential targets for model-free and model-based reduc- 
tions coincide in the context of model (2). The first 
part was given by Cook (2007), Proposition 6, but 
here we establish minimality as well. 

Theorem 2.1. Let i?(X) = r^A~^X, and 
let T(X) be any sufficient reduction. Then, under 
model (2), R is a sufficient reduction and R is a 
function ofT. 

3. ESTIMATION UNDER PFC MODEL (2) 
3.1 Maximum Likelihood Estimators 

First we derive the MLE for the parameters of 
model (2) and then show how to linearly transform 
the predictors to yield a sufficient reduction. Our 
derivation requires that S > 0, Sres > and d < 
min(r,p). The full parameter space (/i,5r,/3, A) for 
model (2) has analytic dimension p + d{p — d)+dr + 
p{p + l)/2. We hold d fixed while maximizing over 
the other parameters, and then consider inference 
for d in Section 5. 

The full log likelihood is 

Ld(/2,5r,/3,A) 

= -^log(2^)-(n/2)log|A| 

y 

•A-i(X,-,x-r/3(f,-f)), 



where we have used the centered f^'s without loss 
of generality. For fixed A and F, this log likeli- 
hood is maximized over /x and (3 by the arguments 
/x = X and P = F-^Ppj-^-i^B, where Pr(A-i) ~ 
r(r^A^ir)-i • r^A^^ is the projection onto Sr 
in the A"^ inner product and B = X^F(F^F)-i is 
the coefficient matrix from the multivariate linear 
regression of X^^ on iy. From the form of f3 we see 
that the MLE of r/3 will be the projection of B onto 
Sr in the A~^ inner product. To find Sr and A we 
substitute fi and /3 into the log likelihood to obtain 
the partially maximized form 



^d(5r,A) 

np 



(3) 



log(27r) - (n/2)log|A| 
(n/2)trace{A-i/2sA-i/2 



-P^_V2rA-^/^SfitA"V2}. 

Holding A fixed, this is maximized by choosing 
P^-i/2p to project onto the space spanned by the 
first d eigenvectors of A"^/^5]fit A"^/^. This leads to 
the final partially maximized log likelihood (Cook, 
2007, Section 7.2) 



(4) 



L,(A) = -^log(2^)-^log 



^tr(A-i£,es)-5 E A.(A-iEfit) 
^ ^ i=d+i 



where Aj(A) indicates the ith eigenvalue of the ma- 
trix A. The MLE A of A is then obtained by max- 
imizing (4). The MLEs of the remaining parameters 
are /i = X, Sr = ^Sd{A, Sgt), and p = {f^A~^f)-^ ■ 
r~^A~^B, where F is any orthonormal basis for 
Sr . It follows that the sufficient reduction is of the 
form i?(X) = (vf A-1/2X, . . . , vjA-V2x)^, where 
Vj is the jth eigenvector of A~^/^Sfit A"-"^/^. The 
following theorem shows how to construct A. 

Theorem 3.1. LetV and A= diagfAi, . . . , Ap) 
be the matrices of the ordered eigenvectors and eigen- 
values o/Srcs^^SfitXlrcs^^, and assume that the nonzero 
Xi's are distinct. Then, the maximum of Ld{A) (4) 

over A> is attained at A = Sres + Sres^VKV'^ilres^, 
where K = diag(0, . . . , 0, \d+i, ■ ■ ■ , Ap) . The maximum 
value of the log likelihood is 



(5) 



i. = -f-flog(2») 
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- -log|Srcs| - ^ 2^ log(l + Ai). 
^ ^ i=d+i 

In this theorem, Aj = for i = r + 1, . . . ,p. Conse- 
quently, if r = d then A = Sres; and the last term 
of Ld does not appear. The maximum value of the 
log likelihood can also be expressed in terms of the 
squared sample canonical correlations r?, j = 1, . . . , 
min(p, r), between X and f^: 

Corollary 3.2. 

np np 
^d = -^ ^log(27r) 

min(p,r) 

-|log|£res| + f E log(l-^f)- 

i=d+l 

The following corollary confirms the invariance of 
R under full rank linear transformations of X. 

Corollary 3.3. If A e W^p has full rank, then 

The next corollary gives five equivalent forms for 
the MLE of A'^cSr. 

Corollary 3.4. The following are equivalent 
expressions for the MLE o/ A~^5r under model (2): 

<5d(A,I]) = 5d(Sres,5]) = 5rf(A,5]fit) = SdC^res, 

The first and second forms — 5d(A,S) = Sd x 
(Srosj^]) — indicate that the sufficient reduction un- 
der model (2) can be computed as the principal 
components based on the linearly transformed pre- 
dictors A-V^x 

or Srcs X, as mentioned in the 
Introduction. The remaining forms indicate that a 
sufficient reduction can be computed also as the 
principal fitted components for A^^/'^X, where A is 
A, Srcs or XI . Although A and Srcs make explicit 
use of the response and S does not, the response still 
enters when using S because we regress 5]~^/^Xy on 
iy to obtain the principal fitted components. Any of 
these five form can be used in practice since they 
each give the same estimated subspace, but we tend 
to use ( Srcs ) S fit) for no compelling reason. 

3.2 Robustness 

In this section we consider the robustness of 5^(5], 
5]fit) as an estimator A~^5r under nonnormality of 
the errors and misspecification of the model for Uy. 
Specifically, we still assume that model (1) holds, 



but now with possibly nonnormal errors that are in- 
dependent of Y and have finite moments. The fitted 
model has mean function as given in (2), but we no 
longer assume that Vy = P{iy — E(fy)}. This then 
allows for misspecification of fy. Let p be the d x r 
matrix of correlations between the elements of vy 
and fy. Then, with this understanding: 

Theorem 3.5. 5(^(5], Xlfit) is a ^Jn consistent 
estimator of A"^5r if and only if p has rank d. 

This result indicates that we may still expect 5rf(S, 
Sfit) to be a reasonable estimator when fy is mis- 
specified, provided that it is sufficiently correlated 
with Vy. It also places the present methodology on 
an equal footing with other y/n consistent methods 
that do not explicitly assume normality at the out- 
set. 

While ^/n consistency does not necessarily guar- 
antee good performance in practice, our experiences 
with simulations suggest that it is not difficult to 
choose an adequate fy. To illustrate we generated 
n = 200 observations from model (1) with d = 1, y ~ 
C/(0, 4), Vy = exp(y), p = 20, T = (1, . . . , 1)^/^20 
and A = Ip. This choice for A involves no loss of 
generality because of the invariance property in Corol- 
lary 3.3. Each data set was fitted with d = 1, fy = 
(y, y^, . . . , y^)'^ for A; = 1, . . . , 6 and with fy = exp(y). 
At the suggestion of a referee, we also included the 
lasso regression of 1" on X (Tibshirani, 1996) con- 
structed by using the R library "relaxo" (Meinshausen, 
2006), which selects the tuning parameter by cross 
validation. For each choice of fy we computed with 
d=l the angle between Sfit) and A~^5r. We 

also computed the angle between the lasso coeffi- 
cient vector and A^^cSr- Figure 1 shows boxplots 
of the angles for 100 replications. For reference, the 
expected angle between A~^5r and a randomly cho- 
sen vector in M^o is 

about 80 degrees. The quadratic 
fy shows considerable improvement over the linear 
case, and the results for the 3rd through 6th degree 
are indistinguishable from those for the model used 
to generate the data. 

The lasso performance was better than PFC with 
fy = y, but not otherwise. However, the performance 
of the lasso may improve in sparse regressions with 
relatively few active predictors. To address this pos- 
sibility, we repeated the simulations of Figure 1 after 
setting elements of F to zero and renormalizing to 
length 1. The performance of all PFC estimators was 
essentially the same as those shown in Figure 1. The 
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Fig. 1. Boxplots of the angle between each of eight estima- 
tors and ^^^St- The first boxplot is for the lasso. Boxplots 
2-8 are for the PFC estimators <Sd(S,Sfit) under various 
choices for fy: boxplots 2-7 are labeled according to the last 
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for {y=exp{y). 
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A; = 1, . . . , 6. The last boxplot is 



lasso did improve, but was still noticeably less accu- 
rate than PFC with fy = (y, y^, . . . , y^)'^ and k>2. 
For example, with only five active predictors, the 
median lasso angle was about 41 degrees. Section 7.2 
contains additional discussion of the lasso. 

4. RELATIONSHIPS WITH OTHER 
METHODS 

4.1 Sliced Inverse Regression 

Cook (2007) proved that when Y is categorical 
and is an indicator vector for the Y categories, 
the SIR estimator (Li, 1991) of the central subspace 
is 5^(5], Sfit). Theorem 2.1 and Corollary 3.4 then 
imply that, under model (2) with Y categorical, the 
SIR estimator is the MLE of A"^5r. This and The- 
orem 3.5 help explain many of the empirical find- 
ings on the operating characteristics of SIR. If Y is 
categorical and model (2) is accurate, SIR will in- 
herit optimality properties from general likelihood 
theory. If Y is continuous and slicing is used then 
SIR will provide a ^/n consistent estimator when p, 
the matrix of correlations between the elements of 
vy and the vector fy of slice indicators, has rank d. 
However, in practice SIRs step functions may pro- 
vide only a rough approximation to E(Xy) and, as a 
consequence, can leave useful intra slice information 
behind. While this information might be recovered 
by intra slice fitting (Cook and Ni, 2006), we expect 
that PFC modeling is not as prone to such informa- 
tion loss. 



4.2 Partial and Ordinary Least Squares 

To develop connections between PFC, partial least 
squares (PLS) and ordinary least squares (OLS), 
consider regressing y on X in two steps, assum- 
ing that d= 1. First we reduce X linearly to G"^X 
using some methodology that produces G G R^'^'?, 
q <P- The second step consists of using OLS to fit 
the regression of Y on G^X. Letting /3q denote the 
resulting vector of coefficient for X, we have 

:G(G^£G)-^G^X^Y/n, 



/3c 



where Y is the n x 1 vector of centered responses 
and /3ois is the vector of coefficients from the OLS 
fit of Y on X. This estimator, which is the projection 
of /3oig onto span(G) in the S inner product, does 
not require computation of if q < p and thus 
may be useful when n <p, depending on the size of 

Clearly, if G = Ip then /3q = 13^^^. Consider next 
constructing G from PFC using fy = y. In that case 

span(G) = 5i(I], Sgt) = span(/3ois), and consequently 
using PFC with fy = y to construct G produces the 
OLS estimator. The simulations shown in the box- 
plot of Figure 1 with fy = y then correspond to OLS. 

Let C = cov(X, Y) and C = X^Y/n. Setting G = 
(C, SC, . . . , S''-^C) yields the PLS estimator with 
q factors (Helland, 1990). PLS works best when C 
can be expressed linear combination of q eigen- 
vectors of 5] with unique eigenvalues (Helland and 
Alm0y, 1994), and then span(G) is an estimator of 
the span of those eigenvectors. From these results it 
can be seen that using the isotonic PFC subspace 
G = 5i(Ip, Sfit) with fy = y produces a /3q that is 
equal to the PLS estimator with q = l. Connections 
between PLS and PFC are less clear when q> 1. 

4.3 Seeded Reductions 

As mentioned in the Introduction, there has been 
recent interest in principal components reduc- 
tive method for regressions in which n <p. Isotonic 
PFC applies directly when n < p, as do the meth- 
ods with a structured A discussed in Section 8. The 
PFC estimator with unstructured A > is not di- 
rectly applicable to n < p regressions, but it does 
provide a seed for recently methodology that is ap- 
plicable. Cook, Li and Chiaromonte (2007) devel- 
oped methodology for estimating the central sub- 
space 5y|x inn <p regressions when there is a pop- 
ulation seed matrix cj) G W^'^ such that span((^) = 
S5y|x- It follows from Corollary 3.4 that, in the 
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context of this article, Sy\x = 'Sr = 
5]~^span(Sfit), where Sgt is the population version 
of Sfif Let the columns of be the eigenvectors of 
Sfit corresponding to its d largest eigenvalues. Then 
span(0gf) = S5y|x and </>fit qualifies popula- 
tion seed matrix. The sample version of 0fjt is con- 
structed in the same way using Sgt . At this point the 
methodology of Cook, Li and Chiaromonte (2007) 
applies directly. 

5. CHOICE OF d 

The dimension d of the central subspace was so far 
assumed to be specified. There are at least two ways 
to choose d in practice. The first is based on using 
likelihood ratio statistics = 2(Ljnjn(-j,p) — L^) = 

— nX]™rf+i''^ log(l — rf) to test the null hypothesis 
d = w against the general alternative d> w. Under 
the null hypothesis d = w, has an asymptotic chi- 
square distribution with (r — w){p — w) degrees of 
freedom. This statistic is the same as the usual likeli- 
hood ratio statistic for testing if the last min(p, r) — d 
canonical correlations are in two sets of jointly 
normal random variables (see, e.g., Muirhead, 1982, 
page 567). In the present application the conditional 
random vector is normal, but marginally X and 
fy will typically be nonnormal. The likelihood ratio 
test (LRT) is used sequentially, starting with w = {) 
and estimating d as the first hypothesized value that 
is not rejected. 

The second approach is to use an information cri- 
terion like AIC or BIC. BIC is consistent for d while 
AIC is minimax-rate optimal (Burnham and An- 
derson, 2002). For w G {0, . . . ,min(r,p)}, the dimen- 
sion is selected that minimizes the information cri- 
terion IC{w) = —2Lw + h{n)g{w), where was 
defined in (5), g{w) is the number of parameters 
to be estimated function oi in our case, 

p{p + 2>)/2 + rw + w{p — w) and h{n) is equal to \ogn 
for BIC and 2 for AIC. These versions are simple 
adaptations of the commonly occurring asymptotic 
forms for other models. 

Since the choice of d is essential to the proposed 
methodology, we next discuss selected results from 
a simulation study to demonstrate that reasonable 
inference on d is possible. We first generated Y ~ 
A^(0, cr^), and then with d=2 generated X^ = Vj3iy + 
e, where e ~ iV(0, A), /3 = h, fy = (y, \y\)^ and T = 
(ri,r2) G W"<\ with Ti = (1,1,-1,-1,0,..., 
0)'^/^/4 and Ts = (1, 0, 1, 0, 1, 0, ... , 0)^/V3. For each 
p, A was generated once as A = A'^A, where A is 



a px p matrix of independent standard normal ran- 
dom variables. Let F{2), F(2,3) and F(2,3,4) de- 
note the fractions of simulation runs in which d was 
estimated to be one of the integer arguments. 

Using fy = {y,\y\,y^)'^ when fitting with model 
(2), Figures 2a-2d give the fraction F{2) of runs in 
which the indicated procedure selected d = 2 versus 
n for p = 5, four values of ay and the three methods 
under consideration. The number of repetitions was 
500. Here and in other figures the variation in the 
results for adjacent sample sizes refiects simulation 
error in addition to systematic trends. The relative 
performance of the methods in Figure 2a depends 
on the sample n and signal ay size, and all three 
method improve as n and ay increase. 

In Figure 3 ay = 2 and n = 200. For Figures 3a 
and 3c, model (2) was fitted with iy = (y, \y\,y^)'^, 
while for the other two figures fitting was done with 
iy = {y, \y\,y'^, ■ ■ ■ ,y^^)'^. Figures 3a and 3b show, as 
expected, that the chance of choosing the correct 
value of d decreases with p for all procedures. Fig- 
ures 3c and 3d show that, with increasing p, LRT 
and AIC slightly overestimate d, while BIC under- 
estimates d. In the case of AIC, we estimated nearly 
a 100 percent chance that the estimated d is 2, 3 or 
4 with 80 predictors, r = 10 and 200 observations. A 
little overestimation seems tolerable in the context 
of this article, since then R will still estimate a suf- 
ficient reduction and we can always pursue further 
refinement based on the subsequent forward regres- 
sions. With underestimation R no longer estimates 
a sufficient reduction. While we believe this to be a 
useful practical result, it is possible that the devel- 
opment of Bartlett-type corrections will reduce the 
tendency to overestimate. Based on these and other 
simulations we judged AIC to be the best overall 
method for selecting d, although in the right situa- 
tion either of the other methods may perform bet- 
ter. For instance, comparing the results in Figures 
3a with the results in Figure 2c, we can see that 
for n = 200 and p = 5, the performance of BIC was 
better than AIC. Nevertheless that is reversed after 
p ~ 10 where AIC consistently gave better results. 

6. TESTING PREDICTORS 

In this section we develop tests for hypotheses of 
the form Y _IL X2IX1 where the predictor vector is 
partitioned as X = (Xf ,X^)^ with Xi G and 
X2 G W^, P = Pi +P2- Under this hypothesis, X2 
furnishes no information about the response once 
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Xi is known. The following lemma facilitates the 
development of a likelihood ratio test statistic under 
model (2). In preparation, partition T = (rf ,r^)-^, 

^ = (^ij)) ^rcs = (Sjj^rcs)) ^ = (^ij)) 

and = (A*-'), i = 1,2, j = 1,2, to conform to 
the partitioning of X. Let A~** = (A")~^. For a 
square partitioned matrix A = (Ajj),i,j = 1,2, let 

A.. ._a.._a..a-1a.. 

■^it-j — -^it '''J jj J*' 

Lemma 6.1. Assume model (2) . T/ien y _IL X2IX1 
if and only i/Ts = - A''^'^ A'^^T i . 

The log likelihood for the alternative of depen- 
dence is as given in Theorem 3.1. The following the- 
orem gives the maximum likelihood estimators un- 
der the hypothesis Y _IL X2IX1. 

Theorem 6.2. Assume that T2 = - A"^^ A^^Ti 
and that d<Ti= min(r,pi). Then, the MLE of A is 

given in blocks by An = Yli^^^^^'V{Ip^ +K.)'V'^'S-^^l J.^g, 



with K = diag(0, . . . , 0, A^+i , . . . , Apj ) and V and Ai , 
. . . , Ap^ the ordered eigenvectors and eigenvalues of 

^ll^rcs^ll,fit^ll^res'' '^12 = ■^ll^l/^i2 and A22 = 

%2.i + S2iSn AiiS5"/s;i2. The MLE of the cen- 
tral subspace is span{(S;^/^ggGi, O)-'"}, with Gi the 

first d eigenvectors of Yl^H'^^Ylii fit's i^l'^^. The max- 
imum value of the log likelihood is 

1 np np n ^ 

(6) 

--log|5:22.i|- 2 E log(l + Ai). 

i=d+l 

Under the hypothesis Y _LL X2IX1 the likelihood 
ratio statistic 6^ = 2(Lrf — L^) has an asymptotic 
chi-squared distribution with dp2 degrees of free- 
dom. By following Corollary 3.2 this test statistic 
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can be expressed also as 

Qd = nlog|S22.l| - relog|5]22.1,res| 

min(p,r) 

+ n log(l - rf) 

i=d+l 
min(pi ,r) 

-n ^ log(l-t?), 

i=d+l 

where ti,. . . , tmm(pi,r) f^re the sample canonical cor- 
relations between Xi and iy. 

By using a working dimension w = min(r,pi) when 
constructing the likelihood ratio statistic, we can 
test predictors without first inferring about d, in the 
same way that we set w = min(r,p) when testing hy- 
potheses about the structure of A. In that case, the 
final term of does not appear. 



To study the proposed tests of y _LL X2IX1, we 
generated data from the model X^^ = Ty + e, where 
e ~ N{0, A) and Y ~ iV(0, ay); A was generated as 
in Section 5, and T = c{TJ,TJ)'^ e R^^ with Ti = 
(1, . . . , 1)^ G r2 = - A-22 A^ipi , and c is a nor- 
malizing constant. Predictor testing is best done af- 
ter choice of d, so the fitted model was (2) with 
d = r = 1 and iy = y. Partition X = (X^, Xg')"^ with 
dim(Xi) = 7. Our general conclusion from this and 
other simulations is that the actual and nominal lev- 
els of the test are usefully close, except when the 
sample size n or signal size ay is quite small. For 
instance, the estimated levels of nominal 5 percent 
tests based on 500 runs were 0.18, 0.08, 0.06 and 
0.05 for sample sizes 20, 40, 100 and 120. The test 
tends to reject too frequently for weak signals or 
small samples. We see that there is again a tendency 
for likelihood methods to overestimate, in this case 
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the active predictors. As with inference on d we do 
not judge this to be a serious issue in the context of 
this article. 

7. ILLUSTRATIONS 

We use two small examples in this section to il- 
lustrate the proposed methodology. The first is the 
most thorough, while the second focuses on selected 
aspects of the methodology. 

7.1 Wheat Protein 

We use Fearn's (1983) wheat protein data for this 
illustration. The response is the protein content of 
a sample of ground wheat, and the predictors are 
— log(reflectance) of NIR radiation at p = 6 wave- 
lengths. We chose this example because the predic- 
tors are highly correlated, with pairwise sample cor- 
relations ranging between 0.9118 and 0.9991. Prin- 
cipal components are often considered in such re- 
gressions to mitigate the variance inflation that can 
be caused by coUinearity. 

Plots of each predictor versus Y (not shown) sug- 
gest that E(X|y = y) might be modeled adequately 
as a quadratic function of y and thus fy = (y,y^)"^, 
but for this illustration we decided to allow more 
flexibility and so set fy = {y,y'^ ,y^)'^ . Fitting model 
(2) with this cubic iy resulted in AIC, BIG and LRT 
all choosing d = 1 , suggesting that only one linear 
combination of the predictors Ri is sufficient. The 
plot in Figure 4a of Y versus Ri shows a strong 
linear relation. In contrast, there is no relationship 
evident in the plot of Y versus the first principal 
component shown in Figure 4b. The first four prin- 
cipal components are needed to linearly reconstruct 
Ri. 

Application of the likelihood ratio test of Sec- 
tion 6 with d = 1 to each predictor gave three small 
p- values (3.8 x 10~^^4.6 x 10-^^^2.5 x 10^^) for the 
third, fourth and sixth wavelengths. The p-values 
for the remaining three wavelengths were all greater 
than 0.33. Evidently, not all wavelengths are neces- 
sary for estimating protein content. Predictor selec- 
tion could be continued by using standard likelihood- 
based methods, including backward elimination or 
an information criterion. 

The estimated PFC is = r)^X, where r) = (0.11, 
0.11, -0.84, 0.50, -0.01, 0.12)"^ is normalized to have 
length 1. Although the predictors with the three 
largest absolute coefficients are same as those found 
to be significant, such coefficients are not generally 



useful indicators of the importance of a predictor. 
As in linear regression, the coefficients depend on 
the scales of the predictors. Multiplying the pre- 
dictors by a diagonal matrix to give scaled 
predictors D~^X results in new coefficients Dr) be- 
cause, from Corollary 3.3, the reduction itself is in- 
variant under full-rank linear transformations of X: 
i?i = r)^X = ■jy'^DD^^X. If an informal comparison 
of coefficients is desirable, then it seems necessary to 
at least standardize the predictor by choosing the di- 
agonal elements of D to be the square roots of the 
diagonal elements of S, Sros or A. Use of S seems 
least desirable because it is affected by the signal, 
S = A + r/3var(fy)/3^r^. 

We found no clear indications in this example that 
the errors deviate significantly from normality. How- 
ever, even if the errors in model (2) are not normal 
or ^y is misspecified, we would still expect reasonable 
estimates because of Theorem 3.5. 

We can use these results to gain insights about 
the types of regressions in which principal compo- 
nents might be effective and about why they appar- 
ently fail in this example. Suppose that there are d 
eigenvectors of A that span 5r to a good approx- 
imation. This happens when A = cx^Ip, as in the 
PC model. Then A^^cSr ~ cSr, i?(X) w F^X and 
there is a d x d orthogonal matrix O so that the 
columns of FO are approximately eigenvectors of 
A. It then follows that there are d eigenvectors of 
S that approximately span Sy- If the signal rep- 
resented by /3var(fy)/3"^ is sufficiently strong then 
these should be the ffi'st d eigenvectors of S with 
relatively large eigenvalues. In short, if the signal 
is sufficiently strong and the eigenvectors of A co- 
operate, then S will exhibit collinearity in the di- 
rection of F. The reverse implication does not nec- 
essarily hold, however. As the present illustration 
shows, collinearity in I] does not necessarily imply 
that it has d eigenvectors that span St to a use- 
ful approximation. Additionally, the correlations be- 
tween the errors e estimated from A range between 
0.911 and 0.9993, so the observed collinearity in S is 
coming largely from A and does not reflect a strong 
signal. 

7.2 Naphthalene 

These data consist of 80 observations from an ex- 
periment to study the effects of three process vari- 
ables in the vapor phase oxidation of naphthalene 
(Franklin et al., 1956). The response is the percent- 
age conversion of naphthalene to naphthoquinone. 
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Fig. 4. Wheat protein data: (a) Response versus the first sufficient component; (b) Response versus the first principal 
component. 
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Fig. 5. Naphthalene data: (a) Response versus the first PFC component; (b) Response versus the lasso fitted values. 



and the three predictors are air to naphthalene ratio, 
contact time and bed temperature. Although there 
are only three predictors in this regression, dimen- 
sion reduction may still be useful for visualization, 
as discussed in the Introduction. 

Based on smoothed plots of the predictors versus 
the response, we used = {y,y^)^ to fit the PFC 
model. The three methods for selecting d discussed 
in Section 5 all chose d= 1. Figure 5a gives a plot of 
response versus the first principal fitted component 
Ri. A plot of the response versus the first principal 
components failed to show any useful relationship, 
as in the wheat protein data. We also included in 
Figure 5b a plot of the response versus lasso fitted 
values. A plot of the response versus the PLS fitted 
values with g = 1 is quite similar to that shown in 
Figure 5b. 

The lasso and similar penalization methods are 
designed to fit a single linear combination of the 



predictors while forcing some of the coefficients to 
0, and thereby provide predictor selection along with 
the fit. If d = 1 and the forward linear model is ac- 
curate then the lasso should perform well. In the 
wheat protein data the lasso fitted values are essen- 
tially the same as those from PFC shown in Fig- 
ure 4. If d = 1 and the forward linear model is not 
accurate, then PFC and the lasso can give quite dif- 
ferent summaries of the data, as illustrated in Fig- 
ure 5. Evidently the lasso favors projections of the 
data that have linear mean functions, and tends to 
neglect projections with nonlinear mean functions. 

The are many examples in the literature where the 
dimension of the central subspace was inferred to be 
larger than 1 (see, e.g.. Cook, 1998). As presently 
designed, the lasso cannot respond to such regres- 
sions since it fits a single linear combination of the 
predictors. Similar comments hold for partial least 
squares and other methods that are constrained by 
fitting one linear combination of the predictors. 
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As presently developed, penalization methods like 
the lasso do not address the issues that drive suffi- 
cient dimension reduction. Relatedly, sufficient di- 
mension reduction methods are not designed for au- 
tomated predictor selection per se. Nevertheless, 
there is nothing in principle to prohibit using pe- 
nalization methods within the context of sufficient 
dimension reduction in an effort to gain the best of 
both worlds. One might proceed in the context of 
this article by adding a penalty in A~^5r to the 
partially maximized log likelihood (3). 

8. STRUCTURED A 

We would expect the previous methodology to be 
the most useful in practice. Nevertheless, models be- 
tween the PFC model with A = aHp and the PFC 
model with A > may be useful in some applica- 
tions. In this section we consider models that al- 
low, for example, A to be a diagonal matrix. This 
will result in a rescaling of the predictors prior to 
component computation, although that scaling is 
not the same as the common scaling by marginal 
standard deviations to produce a correlation matrix. 
The models discussed here may involve substantially 
fewer parameters, perhaps resulting in notable effi- 
ciency gains when they are reasonable. 

Following Anderson (1969) and Rogers and 
Young (1977), we consider modeling A with a lin- 
ear structure: A = J2iLi SiGi, where m < p{p+l)/2, 
Gi,...,Gm are known real symmetric p x p lin- 
early independent matrices and the elements oi d = 
((5i, . . . , 5m,)^ are functionally independent. We re- 
quire also that A~^ have the same linear structure 
as A: A~^ = SiGi. To model a diagonal A we 
set Gj = GieJ , where G W contains a 1 in the ith 
position and zeros elsewhere. This basic structure 
can be modified straightforwardly to allow for a di- 
agonal A with sets of equal diagonal elements, and 
for a nondiagonal A with equal off-diagonal entries 
and equal diagonal entries. In the latter case, there 
are only two matrices Gi = Ip and G2 = ee-'", where 
e G has all elements equal to 1 . 

Estimation of the central subspace A~^5r with a 
constrained A follows that of Section 3 up to Theo- 
rem 3.1. The change is that A is now a function of 
S € M™". Thus Lfi{ A((5)} (4) is to be maximized over 
S. In contrast to the case with a general A, here were 
unable to find a closed-form solution to the maxi- 
mization problem, but any of the standard nonlin- 
ear optimization methods should be sufficient to find 
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Fig. 6. Tests of a diagonal A; The x-axis represents sample 
size and the y-axis represents the fraction P of time the null 
hypothesis is not rejected. 

argmax^ L{ A(5)} numerically. We have used an al- 
gorithm (Appendix B) to solve dL{A{S)}/dd = 
iteratively. The starting point is the value that max- 
imizes Lii when r = d since then the maximum can 
be found explicitly. The resulting estimator^ of the 
central subspace can be described as 5d(A,5]fit), 
where A is the MLE of the constrained A, but 
Corollary 3.4 no longer holds. 

A model with constrained A can be tested against 
(2) by using a likelihood ratio test: under the con- 
strained model Orf = 2{L(i — Ld(A)} is distributed 
asymptotically as a chi-squared random variable with 
p{p +l)/2 — m degrees of freedom. This test requires 
that d be specified first, but in practice it may be 
useful to infer about A prior to inference about d. 
This can be accomplished with some loss of power 
by overfitting the conditional mean and using the 



statistic 



min(r,p) ) 



which has the same asymptotic 



null distribution as 0^. 

To confirm our asymptotic calculations, we gener- 
ated data from the simulation model Xj^ = Ty + £, 
with Y ~ iV(0, 1), r = (1, . . . , 1)^/Vp e and s ~ 
A^(0,A), where A is a diagonal matrix with en- 
try (i,i) equal to 10*~^. For the fitted model we 
used the working dimension w = r, since inference 
on A will likely be made prior to inference on d, and 
= (y, . . . , y^)'^ . Testing was done at the 5 percent 
level and the number of repetitions was 500. Figure 
6 gives graphs of the fraction of runs in which the 
null hypothesis was not rejected versus sample size 
for various values of r and p = 6. These and other 
simulation results show that the test performs as ex- 
pected when n is large relative to p. As indicated in 
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Figure 6, our simulation results indicate that, with d 
fixed, the sample size needed to obtain good agree- 
ment between the nominal and actual levels of the 
test increases with r. 

9. DISCUSSION 

The methods proposed in this article provide likeli- 
hood-based solutions to the two long-standing prob- 
lems that have hounded principal components, es- 
tablish a likelihood-based connection between prin- 
cipal fitted components and model-free sufficient di- 
mension reduction and provide insights about the 
types of regressions in which principal components 
might be useful. When model (2) is accurate, the 
methodology will inherit optimality properties from 
general likelihood theory, while otherwise providing 
^/n consistent estimators under relatively weak con- 
ditions. Additionally, there are no restrictions on the 
nature of the response, which may be continuous, 
categorical or even multivariate. Perhaps the main 
limitations are that var(X|y) must be constant or 
approximately so, and the methods are not designed 
for discrete or categorical predictors. Investigations 
into extensions that address these limitations are in 
progress (Cook and Forzani, 2009). 

APPENDIX A: PROOFS OF THE RESULTS 
FROM SECTIONS 3 AND 6 

Proof of Theorem 2.1. The condition y|X ~ 
Y\T holds if and only if X|(r, Y) ~ X|r. Thus, think- 
ing of Y as the parameter and X as the data, T 
can be regarded sufficient statistic for X|y. 

The conclusion will follow if we can show that R 
is a minimal sufficient statistic for X|y. Note that 
in this treatment the actual unknown parameters 
(/^,5r,/3. A) play no essential role. 

Let g{x\y) denote the conditional density of X| (y = 
y). To show that i? is a minimal sufficient statistic 
for X|y it is sufficient to consider the log likelihood 
ratio 

^ogg{z\y)/g{^\y) 

= -(l/2)z^A-^z + (l/2)x^A-^x 

+ (z-xf A^Vr 

If log g{z\y) / g{x.\y) is to be a constant in y then we 
must have log c/(z|y)/c/(x|?/) = E{logff(z|y)/5-(x|y)} 
for all y. Equivalently, we must have (z — x)'^A~^ • 
(Py — fly) = 0- Let r G MP^'^ be a basis for span(/x — 



/Xy). Then the condition can be expressed equiva- 
lently as (z — x)"^r/3fy = 0, and the conclusion fol- 
lows. □ 

Let denote the set oi q x q positive definite 
matrices. 

Proof of Theorem 3.1. We use / as a generic 
function whose definition changes and is given in 
context. We will make a series of changes of variables 

^1/2 _i ^ 1/2 

to rewrite the problem. Let U = Srcs A Srcs so 
that maximizing (4) is equivalent to maximizing 

/(U) = log|U|-tr(U) 

- ^ A,(U£-V2£fit£res/')- 

i=d+l 

Let T = min(r,p) and use the singular value decom- 
position to write 5]res'^^5]fit5]rcs^^ = VAt-V"^ where 
V € MP^"^ is an orthogonal matrix and At- = diag(Ai, 
. . . ,Xr,0, . . . ,0), with Ai > A2 > • • • > > 0. Calling 
H = V^UV G S+, (7) becomes 

T 

(8) /(H)=log|H|-tr(H)- J] Xi(HAr). 

i=d+l 

We now partition H as H = (Hjj)jj=i^25 with Hn G 
§+, H22 G Sp_r [foi' P = T we take H = Hn and go 
directly to (9)]. Consider the transformation to 
the space §+ x §+_^ x M^x(p-^) given by Vn = Hn, 

V22 = H22 - Hf2Hr/Hi2 and V12 = Hr/Hi2. This 
transformation is one to one and onto (Eaton, 1983, 
Proposition 5.8). As a function of Vn, V22 and 
V12, (8) can be written as 

log|Vii||V22| - tr(Vii) - tr(V22) 

T 

- tr(Vf2VnVi2) - ^^(ViiA.), 

i=d+l 

where A,- = diag(Ai, . . . , At), and we have used the 
fact that the nonzero eigenvalues of HA,- are the 
same as those of HuAt. The term - tr(Vf2ViiVi2) 
is the only one that depends on V12. Since Vn 
is positive definite, Vj"2ViiVi2 is positive semidefi- 
nite. Thus, the maximum occurs when V12 = 0. This 
implies that H12 = 0, Hn = Vn, H22 = V22, and 
we next need to maximize 

/(Hn,H22) 

= log|Hn| +log|H22| 

r 

- tr(Hn) - tr(H22) - ^ H^ln^r)- 

i=d+l 
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This function is maximized over H22 at H22 = Ip-r; 
then we need to maximize 

/(Hii) = log|Hn|-tr(Hn) 



(9) 



^ Ai(HnA,) 

i=d+l 



Since the trace and the eigenvalues are cychc oper- 
ations, 

tr(A-iE,es) = tr(A-i/2s,esA-i/2) 
= tr{V(Ip + K)-iV^} 



(12) 



~ 1 /2 ~ 1 /2 

Letting Z = Ar HuAr leads us to maximize 
/(Z) = log |Z| - tr(ZA-i) -ELd+i Ai(Z). Since Z G 
S;^, there exists an F = diag(/i, ...,/,-) with /j > 
in decreasing order and an orthogonal matrix W in 
W'' such that Z = W^FW. As a function of W 
and F, we can rewrite the function / as 

r 

/(F,W)=log|F|-tr(W^FWA^i)- ^ fi 

i=d-\-l 

T 

= log|F| - tr(FWA7iW^) - ^ fi. 

i=d+l 

Now, using a lemma from Anderson (1971), Theo- 
rem A.4.7, minw tr(FWA-iW^) = J2i=i fiK^, and 
if the diagonal element of F and A,- are distinct, 
the minimum occur when W = I^. Knowing this, 
we can rewrite the problem one last time, as that of 
maximizing in (/i, . . . , fr), all greater than zero, the 
function 



■ d+ l/(l+A,) + (p-r), 



i=d+l 



^ A,(A-iSfit)= E A.{V(I + K)-i 

i=d+l i=d+l 

T 

(13) = J2 A,{V(I + K)-iV^VKV^} 

i=d+l 

= x: M(i+K)-^K} 

i=d+l 

_ ^ Aj 
i=d+i 1 + Aj 
Since Sres > we have 

log I A| = log |£j/s'V(I + K)Y^±Us\ 



(14) 



(10) 



/(/i,...,/r) 



log I Sres 1+ ^ log(l + Ai 

i=d+l 



i=l i=l i=d+l 



Clearly the maximum will occur at fi = Aj for i = 
1, . . . ,d and for i = d+ 1, . . . , r, /j = Xi/{Xi + 1). Since 
Aj are positive and decreasing order, fi are positive 
and decreasing in order. Since all the Aj are differ- 
ent, the fi are different. Collecting all the results, 
the value of A that maximizes (4) is 



0. 



rx(p-r) )V^gl/2 



^res \ n • 

\ '^{p—t)xt ^p—tx{p—t) 

where Ay^Z^^A^^ = diag(Id, A^+i + 1, . . . , A^ + 1). 

Now, to obtained the maximum value we replace 
A by A in (4), 



LdiA) 
(11) 



np 



log(2^) 



n . 



■loglAI 



-tr(A' 
2 ^ 



^ E A.(A-iEfiO. 

i=d+l 



Plugging (12), (14) and (15) into (12) we obtain (5). 

□ 

Proof of Corollary 3.2. The eigenvalues % 

of Sros^ SfjtSres^ are the Same as those of Sj.gg Sgt . 
These eigenvalues are related to the eigenvalues rf 
of £-^£fit by 1 + Ai = (1 - r2)-i (Cook, 2007, Ap- 
pendix 7). Now the eigenvalues of 5]~^^fit are the 
same as those of 

£-V2gfi,g-i/2 = s-V2(x^F/n)(F^F/7^)-i 
• (F^X/n)S-V2, 

where X"^F/n is the p x r matrix of sample corre- 
lations between X and f and F^F/n is the sample 
covariance matrix of f . □ 

Proof of Corollary 3.3. RecaU from Theo- 
rem 3.1 that A = Srcs + SrVs^VKV^ErVs^ where V 

^ —1/2 ^ ^—1/2 

contains the eigenvectors of B = Srcs Sfit5]rcs • 
The transformation X — > AX transforms B — > OBO"^, 

where O = {A'S^-csA.'^)"^^'^ A.'sUs is an orthogonal 
matrix. Consequently, under the transformation K 
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is invariant, V ^ OV and A AAA"^. The rest 
of the proof follows similarly. □ 

To prove Corollary 3.4 we need a lemma. 

Lemma A.l. Let\ = Y.reJ'^YM^/'^ , where M = 
(Ip + K)-^ with V and K as in Theorem 3.1. 

— 1/2 - 

Then A V are the normalized eigenvectors of 
A-i/2gfit X A-V2. 



Proof. From Theorem 3.1, 



1/2 



Then, A-l = llrel'^ilp + K)-'V''T,rcs 

= Eres^^VMY^Sres^^ Using the fact that V are 

^ 1/2^ ^ 1/2 

the eigenvectors of Srcs '^fit^rcs we get 

= E-^/^VMA^M^/^ ^ VM^/2^^M^/2 
= VMA^, 

where MA^ = diag(Ai, . . . , A^, Arf+i/(A<i+i + 1),..., 
A^/(A^ + 1),0,...,0). Therefore A-^Sfit has eigen- 
values Ai , . . . , Ad, Ad+i/ (Ad+i + 1), • • -j^t/ (Ar + 1), 
0, . . . , with eigenvectors V, and V"^ AV = Ip. □ 

Proof of Corollary 3.4. From the develop- 
ment leading to Theorem 3.1, the MLE of 
span(A~^r) is ^^(^j^fit)) which establishes the 
first form. Now, from Lemma A.l, span of the first d 
columns of A-V2;^V2v = V is the MLE for 
span(A-ir). Since V = £r"cs''^VMV2 ^j. 
agonal full rank with the first d elements equal 1, the 
span of the first d columns of V is the same of the 

^ — 1/2 ^ ^ 

first d columns of 5]rcs V where V are the eigenvec- 
tors of Sres^'^SfitSrcs^^- This prove the fourth form. 
The proof of the fifth form can be found in Cook 
(2007) and it follows from the fact that the eigen- 
vectors of S~^I]fit and S~s5]fit are identically, with 
corresponding eigenvalues Aj and Ai/(1 — Aj). The 
corollary follows now from the relation between the 
eigenvectors of the product of the symmetric matri- 
ces AB and the eigenvectors of A}^'^'QA^^'^ . The sec- 
ond and the third forms follow from the fourth and 
fifth forms and from the fact that S = Srcs + Sfit. 
□ 

Proof of Theorem 3.5. It is sufficient to con- 
sider the limiting behavior of S~^Sfit, because 



Sd{%%it)^= £-i/2spand(^-V2^fi,S-i/2) = 
spand(5]~^5]fit), where spand(A) indicates the span 
of the first d eigenvectors of A. 

The following statements on large sample behav- 
ior follow the general line that Li (1991), Section 5, 
used in his demonstration of \/n consistency for SIR. 
Since I] is the marginal sample covariance matrix 
of X, its asymptotic behavior depends only on the 
true model. It is know that under the stated assump- 
tions S is a ^Jn consistent estimator of SI = rVF"^ -|- 
A, where V = var(i/y) > 0. Consequently, Xl"^ is 
a i/n consistent estimator of Next, as given 

in Section 2.1, %t = (X'^F/n)(F^F/n)-i(F'^X/n) 
which converges at ^/n rate to Sgt = cov(X, f )cov(X, f)"^, 
where we have assumed var(fy) = 1^ without loss of 
generality. Using model (1) for X we have, cov(X, f ) = 
rC, where C = cov(i/'y, fy). Consequently, S^-'^Sgt 
converges at ^Jn rate to 5]~^5]gt = (rVT"^ -|- 
A)-irCC'^r-\ and the first d eigenvectors of S^^Sfit 
converge at the -y/n rate to corresponding eigenvec- 
tors of S-^Sfit. 

Using the form S'^ = A'^ - A-ir(V-i + • 
^-ip^)-ij-T^-i simplifying we find ^^"^Slfit = 
A-^rKCC^r^, where K = {\-^ + ^'^Vy^Y-^ 
is a full rank dx d matrix. Clearly, span(S~^Sfit) C 
A^^5r with equality if and only if the rank of FKC • 
(-iTpT -g gq^g^} ^_ Since V has full column rank 
and K is nonsingular, the rank of TKCC-^r^ is 
equal to d if and only if the rank of CC"^ is equal 
to d. The result follows since p = V-i/^c, recalling 
that we have assumed var(fy) = 1^. □ 

Proof of Lemma 6.1. y _IL X2IX1 if and only 
if y _IL X|Xi. Suppose that Y _U_ X|Xi. We know 
that r"^A~^X is the minimal sufficient reduction 
and thus it should not depend on X2. Now, 



(15) 



r^A-^x = (rf An + rl^A2i)Xi 



+ (rfAi2 + r^A22)X2 



will not depend on X2 if and only if A^^ + Fg A 
equivalently r2 = — A~22^2i-p^^ "YYie reciprocal 
follows directly if we replace r2 by — A~22^2ip^ 
on equation (15). □ 

Proof of Theorem 6.2. After maximizing the 
log likelihood over (/x,/3), we need to maximize on V 
and A-i /(r,A-i) = log|A-i| - tr(A-iS) + 
tr(A~"'^Pp^^-ij5]fit). From the hypotheses on F, we 
have F^A^i = (Ff Ai^-2,0) where A^^-^ = A" - 
^12^-22^21^ Then F^A-^F = Ff A^^'^Fi. For fixed 



A 22 
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A, the last term is maximized by choosing (A^^'^)"*^/^ • 
Fi to be a basis for the span the first d eigenvectors 
of (Ai^-2)V2g^^ ^.^(^11.2)1/2^ yielding another par- 
tially maximized log likelihood 

/(A-i) = log|A-i|-MA-iS) 

(16) 



From Theorem 3.1 the MLE for L^^/ is ^\{^^^^'V{ld + 



^ ^ "1/2 " " 

K)V'^S/i ,,3, where K and V are as defined in The- 

it 
The 



orem 6.2. Since L 



11 



A12a 



-22 A 21 



follows that A 



11 



11-^ ^ ^ -"11 



MLE for A22 is S^l^ and for A^^ _s-/Si2S^2^p 



_^^_Xi{(A^^-2)^/2Siifit(A^^-2)^/^}. Consequently, A12 = AnS^^ S12 and A22 = S;22.i + 



i=l 

Let us take the one-to-one and onto transformation 
defined by Ln = A^^ - A^^A-^Sa^i, L22 = 
and L12 = A-'^^A"^^. As a function of Ln, L22, L12 
we get 

/(Lll,L22,Li2) 

= log|Lii| + log|L22| 

— tr(L225]22 + L22Lf25^12) 

— tr{(Lii + Li2L22L-f2)5]ii 



+ Li2L225]2l} 

d 

^Aj(L-|^^ Sii fitL 



11 



i=l 



Now, differentiating with respect to L12 in the last 
expression, we get that 

df 



9Li2 



-2XI12L22 — 2SI11L12L22 and 

-2i;ii L22. 



Therefore the maximum occurs when L12 = — '^u ■ 
5] 12- Replacing this in the last log likelihood func- 
tion we next need to maximize 



/(Lii,L: 



22 J 



log|Lii| +log|L22| 

- tr(L22S22) - tr(LiiSii) 
+ tr(L225]i25];^/5]i2) 



+ ^A,(L}fSn,fitL 



11 7' 



i=l 



since for L12 = -'S^^'Su, -2tr(L22Lf25]i2) - 
tr(Li2L22Lf25]ii) = tr(L225]i25^r/5]i2). The max- 
imum on L22 is at L22 = "^221 that we need to 
maximize on Ln 

/(Lii) = log|Lii|- tr(LiiSii) 



+ X! Ai(Lj(^Sii,fitL{(''). 



1/2. 



£21^1/ All S-|^-^^Si2. 
The MLE for the span of A-^F = (A^-^F^O)'^ 

is the span of (A^^j^^^^Fi, 0)-^ with Fi the first d 



eigenvectors of Sn gt A^j^^'^. Using the logic 

of Corollary 3.3 it can be proved that the MLE of 

span(A~^F) is in this case the span of (Si/^^gFi, 0)"^, 

" " 1/2 ^ ^ 1/2 

with Fi the first d eigenvectors of 'S^i j^j,gl]ii^fit5];^^ (,gg. 

The proof of (6) can be done essentially as the 

proof of (5). □ 

APPENDIX B: ALGORITHM FOR A WITH 
LINEAR STRUCTURE 

We will maximize (4) function / of S = A 
We first find the derivative with respect to S without 
considering any structure. Because Sfn is symmetric 
we get 



1/2 



dm 

dS 



S-i-S,es- v.(SSfit)uf(SEfit)5]fit, 

i=d+l 

where Uj and Vj indicate respectively the normal- 
ized right and left eigenvectors corresponding to the 
eigenvalue Aj of SSfif Now, dS/dsh = Gh and 

df{S) 



(17) 



ti{S-'Gh) - tr(S,esGO 

r 

- tr{Sfitu,(SSfit)vf(SSfit)G4. 

i=d+l 

Denote by Uj the ith eigenvector of S^/^SlfitS^/'^ = 
A~^/^Sfit A~^/^ corresponding to the Aj eigenvalue 
(in decreasing order) normalized with unit norm. 
Then Ui = S^/^Q, = D'^^^.^ ^ S'^/^Ui = D^/^Ui, 
Sg^-Uj = AjD^/^Uj and vj"uj = if i / j and 1 oth- 
erwise. We can rewrite (17) as 

df{S) 



(18) 



dsh 



tr(AGO - tr(SresG^) 
- Y A.tr(Ai/2a,ufAi/2G,). 

i=d+l 



i=l 



To find the MLE we need to solve df{S)/dsh = 
for h = 1, ... ,171. Using (18) we can rewrite df{S)/dsh 
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using the vec operator as 

vec(Gh)'^vec(A) 
(19) =vec(Gfe)^vec(£res) 

+ yec{Ghf E A.vec(Ai/2ti,ufAi/2) 

i=d+l 

for h = 1,. . . , m. Let G = {vec(Gi), . . . , vec(Gm)}. 
Since A = J2iLiSiGi, vec(A) = G6 we can 
rewrite (19) for all h as 

G^G5 = G^|vec(Srcs) 

(20) 

+ A,vec(AV2u,ufAV2)l. 

i=d+l } 

Now, if r = d we get 5 = (G^G)-^G'^vec(Srcs) and 
= (EZiSiGi)-^. The algorithm will be: 

1. Set 6o = {61 ...,<) = (G^G)~iG^vec(£re.). 

2. Compute Aq = E^i 6°Gi and So = Aq\ 

3. Compute until convergence, n = 1, 2, . . . , 

A„ = (G^G)-iG^ 
• vec(5]i.cs) 

+ ± Af"-vec{A:V2^uf"-(uf"-)-A:^,} 

j=c(+l 

with Sn-i = A~\ and A^"-i and u^"-i denoting 
respectively the eigenvalues and eigenvectors of 

^1/2 ^ ^1/2 
*n-l^fit»ri-l- 
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